26 research outputs found

    Exploiting Heterogeneous Parallelism With the Heterogeneous Programming Library

    Get PDF
    [Abstract] While recognition of the advantages of heterogeneous computing is steadily growing, the issues of programmability and portability hinder its exploitation. The introduction of the OpenCL standard was a major step forward in that it provides code portability, but its interface is even more complex than that of other approaches. In this paper, we present the Heterogeneous Programming Library (HPL), which permits the development of heterogeneous applications addressing both portability and programmability while not sacrificing high performance. This is achieved by means of an embedded language and data types provided by the library with which generic computations to be run in heterogeneous devices can be expressed. A comparison in terms of programmability and performance with OpenCL shows that both approaches offer very similar performance, while outlining the programmability advantages of HPL.This work was funded by the Xunta de Galicia under the project “Consolidación e Estructuración de Unidades de Investigación Competitivas” 2010/06 and the MICINN, cofunded by FEDER funds, under grant TIN2010-16735. Zeki Bozkus is funded by the Scientific and Technological Research Council of Turkey (TUBITAK; 112E191)Scientific and Technological Research Council of Turkey (TUBITAK); 112E19

    Developing adaptive multi-device applications with the Heterogeneous Programming Library

    Get PDF
    [Abstract] The usage of heterogeneous devices presents two main problems. One is their complex programming, a problem that grows when multiple devices are used. The second issue is that even if the codes for these devices can be portable on top of OpenCL, they lack performance portability, effectively requiring specialized implementations for each device to get good performance. In this paper we extend the Heterogeneous Programming Library (HPL), which improves the usability of heterogeneous systems on top of OpenCL, to better handle both issues. First, we provide HPL with mechanisms to support the implementation of any multi-device application that requires arbitrary patterns of communication between several devices and a host memory. In a second stage HPL is improved with an adaptive scheme to optimize communications between devices depending on the execution environment. An evaluation using benchmarks with very different nature shows that HPL reduces the SLOCs and programming effort of OpenCL applications by 27 and 43 %, respectively, while improving the performance of applications that exchange data between devices by 28 % on average.Xunta de Galicia; GRC2013/055Ministerio de Economía y Competitividad; TIN2013-42148-PConsejo de Investigación Científica y Tecnológica de Turquía (TUBITAK); 112E191European Cooperation in Science and Technology (COST); IC130

    Compiling Fortran 90D/HPF for distributed memory MIMD computers

    Get PDF
    This paper describes the design of the Fortran90D/HPF compiler, a source-to-source parallel compiler for distributed memory systems being developed at Syracuse University. Fortran 90D/HPF is a data parallel language with special directives to specify data alignment and distributions. A systematic methodology to process distribution directives of Fortran 90D/HPF is presented. Furthermore, techniques for data and computation partitioning, communication detection and generation, and the run-time support for the compiler are discussed. Finally, initial performance results for the compiler are presented. We believe that the methodology to process data distribution, computation partitioning, communication system design and the overall compiler design can be used by the implementors of compilers for HPF

    Compiling Fortran 90D/HPF for Distributed Memory MIMD Computers

    No full text
    Distributed memory multiprocessors are increasingly being used to provide high performance for advanced calculations with scientific applications. Distributed memory machines offer significant advantages over their shared memory counterparts in terms of cost and scalability, though it is widely accepted that they are difficult to program given the current status of software technology. Currently, distributed memory machines are programmed using a node language and a message passing library. This process is tedious and error prone because the user must perform the task of data distribution and communication for non-local data access. This thesis describes an advanced compiler that can generate efficient parallel programs when the source programming language naturally represents an application's parallelism. Fortran 90D/HPF described in this thesis is such a language. Using Fortran 90D/HPF, parallelism is represented with parallel constructs, such as array operations, where statements, f..

    Hybrid MPI plus UPC parallel programming paradigm on an SMP cluster

    No full text
    The symmetric multiprocessing (SMP) cluster system, which consists of shared memory nodes with several multicore central processing units connected to a high-speed network to form a distributed memory system, is the most widely available hardware architecture for the high-performance computing community. Today, the Message Passing Interface (MPI) is the most widely used parallel programming paradigm for SMP clusters, in which the MPI provides programming both for an SMP node and among nodes simultaneously. However, Unified Parallel C (UPC) is an emerging alternative that supports the partitioned global address space model that can be again employed within and across the nodes of a cluster. \ud \ud In this paper, we describe a hybrid parallel programming paradigm that was designed to combine MPI and UPC programming models. This paradigm's objective is to mix the MPI's data locality control and scalability strengths with UPC's fine-grain parallelism and ease of programming to achieve multiple-level parallelism at the SMP cluster, which itself has multilevel parallel architecture. Utilizing a proposed hybrid model and comparing MPI-only to UPC-only implementations, this paper presents a detailed description of Cannon's algorithm benchmark application with performance results of a random-access benchmark and the Barnes-Hut N-Body simulation. Experiments indicate that the hybrid MPI+UPC model can significantly provide performance increases of up to double in comparison with UPC-only implementation, and up to 20% increases in comparison to MPI-only implementation. Furthermore, an optimization was achieved that improved the hybrid performance by an additional 20%

    Analytical Expense Management System

    No full text
    Although the development of communication technologies (e.g: UMTS, ADSL) allowed the elaboration of multiple users' Web applications (e.g. information storage), there are still many improvements on many applications to be done and uncovered areas. Expense management systems on Web application area are still in their infancy. Expense management software is widely spread in companies and most of time supported by their intranet. These solutions are quite simple as they mainly collect the information related to the expenses and may propose a simple aggregation of these figures. The result is close to what an excel sheet provides

    Benchmarking the Computation and Communication Performance of the CM-5

    Get PDF
    Thinking Machines' CM-5 machine is a distributed-memory, message-passing computer. In this paper we devise a performance benchmark for the base and vector units and the data communication networks of the CM-5 machine. We model the communication characteristics such as communication latency and bandwidths of point-to-point and global communication primitives. We show, on a simple Gaussian elimination code, that an accurate static performance estimation of parallel algorithms is possible by using those basic machine properties connected with computation, vectorization, communication, and synchronization. Furthermore, we describe the embedding of meshes or hypercubes on the CM-5 fat-tree topology and illustrate the performance results of their basic communication primitives. 1 This work was supported in part by NSF under CCR-9110812 and by DARPA under contract # DABT63-91-C-0028. This work was also supported in part by a grant of HPC time from the DoD HPC Shared Resource Center, Army Hig..
    corecore